Skip to content

ci: shard tests to run more in parallel#2345

Merged
terrykong merged 75 commits into
mainfrom
chtruong/shard-tests
May 29, 2026
Merged

ci: shard tests to run more in parallel#2345
terrykong merged 75 commits into
mainfrom
chtruong/shard-tests

Conversation

@chtruong814

@chtruong814 chtruong814 commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replaces the monolithic L0 unit-test scripts with targeted shards grouped by backend marker and test domain.
    • Backend catch-all shards cover mcore, automodel, vllm, sglang, and nemo_gym markers across the unit suite.
    • Base domain shards cover models, algorithms, data, distributed, environments, and other unmarked tests.
    • Large policy/model/vLLM groups are split with pytest-shard so CI can run them in parallel.
  • Replaces the monolithic L1 GPU functional script with framework- and algorithm-focused shards for Megatron, AutoModel, SGLang, Gym, GRPO, SFT, Eval, and Other tests.
  • Updates the GitHub Actions matrices to run the new L0, L1, GB200 L1, and Lfast shard sets in parallel.
  • Adds a test approval queue so the expanded shard matrix is gated by a concurrency-managed queue instead of allowing too many CICD workflows to run at once.
  • Adds shared unit-shard setup and makes tests/run_unit.sh treat pytest exit code 5 (no tests collected) as success for shard/FAST safety.
  • Had to increase the timeout of some vllm H100 tests for some reason. Also had to skip fp8 vllm tests. H100 had some failures. This is first time we are running tests on H100. Different issue than the GB200 issues reported in vllm generation with fp8 fails on gb200 and h100 #2081.

Test approval queue

  • Adds Approve Test Queue, a scheduled/manual workflow that uses the shared FW-CI test approval queue template for CICD NeMo RL.
  • Adds a cicd-wait-in-queue gate in the main workflow for PR Lfast/L0/L1/L2 runs before container builds and test jobs proceed.
  • Concurrency is controlled with repo variables: MAX_CONCURRENCY for internal runs and MAX_CONCURRENCY_EXTERNAL for external runs, both defaulting to 3.

SGLang default

  • SGLang build and SGLang unit/functional test shards are skipped by default through the SKIP_SGLANG workflow setting.
  • Set SKIP_SGLANG=false to build SGLang and run the SGLang shards.

Test plan

  • Verify the L0 unit shard matrix with CI:L0 or higher.
  • Verify the L1 functional shard matrix with CI:L1.
  • Verify CI:Lfast mode still applies FAST exclusions correctly.
  • Verify the test approval queue gates PR CICD runs and respects the configured concurrency limits.
  • Verify coverage artifacts upload and combine correctly across the new shard names.

Restructure unit test CI from 3 monolithic shards (Generation, Policy,
Other) into 9 targeted shards split by extra/marker. Each extra-specific
shard (mcore, automodel, vllm, sglang, nemo_gym) runs a single
--*-only flag across all unit tests, while domain shards (models,
environments, algorithms, other) run only base (unmarked) tests.

This eliminates the 5-6 sequential pytest invocations per shard,
reduces the bottleneck from 90 min (Policy) to ~30 min per shard,
and makes it clear where new tests should be added.

New shards:
- L0_Unit_Tests_Vllm: base vllm generation + --vllm-only catch-all
- L0_Unit_Tests_Sglang: base sglang files + --sglang-only catch-all
- L0_Unit_Tests_Mcore: --mcore-only catch-all
- L0_Unit_Tests_Automodel: --automodel-only catch-all
- L0_Unit_Tests_Nemo_Gym: --nemo-gym-only catch-all
- L0_Unit_Tests_Models: base model tests (minus generation)
- L0_Unit_Tests_Environments: base environment tests
- L0_Unit_Tests_Algorithms: base algorithm tests
- L0_Unit_Tests_Other: catch-all for remaining base tests + research

Also fixes run_unit.sh to treat pytest exit code 5 (no tests collected)
as success, preventing shard failures when FAST exclusions remove all
tests from a shard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814 chtruong814 requested a review from a team as a code owner April 26, 2026 16:25
@copy-pr-bot

copy-pr-bot Bot commented Apr 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the CI Relating to CI label Apr 26, 2026
@chtruong814 chtruong814 added CI:L1 Run doctests, unit tests, and functional tests CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 26, 2026
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

chtruong814 and others added 3 commits April 26, 2026 19:58
The truncated field depends on exact generation output from the tiny
model, which is not reproducible across runs. Instead of comparing
exact bool values, verify that each value is a bool type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
The Mcore shard (50 min) and Automodel shard (38 min) are bottlenecked
by heavy policy worker tests (test_megatron_worker.py and
test_dtensor_worker*.py). Split each into two shards:

- L0_Unit_Tests_Mcore: mcore tests excluding unit/models/policy/ (~15 min)
- L0_Unit_Tests_Mcore_Policy: mcore tests from unit/models/policy/ only (~30 min)
- L0_Unit_Tests_Automodel: automodel tests excluding unit/models/policy/ (~10 min)
- L0_Unit_Tests_Automodel_Policy: automodel tests from unit/models/policy/ only (~28 min)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Split L0_Unit_Tests_Other into three shards:
- L0_Unit_Tests_Data: data pipeline tests (datasets, processing, message utils)
- L0_Unit_Tests_Distributed: distributed infra tests (worker groups, virtual cluster, logprob)
- L0_Unit_Tests_Other: catch-all for remaining (experience, utils, tools, evals, rewards, root tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

chtruong814 and others added 4 commits April 27, 2026 07:56
The qwen2 parametrizations in test_megatron_policy_training,
test_megatron_policy_logprobs, and test_megatron_policy_topk_logits
are redundant — the assertions are model-agnostic (no NaN/Inf, correct
shapes, loss decreases) and the Qwen->Megatron converter path is
thoroughly covered by functional tests (grpo_megatron.sh,
dpo_megatron.sh, sft_megatron.sh all use Qwen models).

Removes 14 test instances:
- training: 9 → 7 (dropped 2 qwen2 variants)
- logprobs: 12 → 6 (dropped 6 qwen2 variants)
- topk: 12 → 6 (dropped 6 qwen2 variants)

Estimated savings: ~5-10 minutes on the Mcore_Policy shard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…re combos

The training_setup fixture tested 5 model architectures (llama, qwen2,
qwen3, gemma3, nemotron5_h) but the assertions are model-agnostic
(no NaN/Inf, loss decreases, flops tracking). Model compatibility is
covered by functional tests (grpo.sh, grpo_fsdp2.sh, dpo.sh, sft.sh
use Qwen and Gemma models).

Consolidate to llama-only while preserving all feature combinations
(sp, cpu_offload, activation_checkpointing, cp, and their combos).

Reduces from 23 → 10 parametrized test instances.
Logprob_setup left unchanged since it validates numerical correctness
via torch.allclose per architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Guard the truncated field check with a key existence check since the
expected_result dict no longer contains the truncated field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
The truncated field was incorrectly removed from expected_result in an
earlier commit. It should remain present so _standardize can validate
the field contains bools before popping it from both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

chtruong814 and others added 2 commits April 27, 2026 08:12
Refactor test_megatron_worker.py to use a class-scoped Ray cluster
fixture (TestMegatronTwoGPU) for the parametrized tests, following
the same pattern as test_dtensor_worker.py's TestTwoGPUCluster.

Previously, each parametrized test (training×7, generation×2,
logprobs×6, topk×6 = 21 tests) created and destroyed its own
RayVirtualCluster. Now they share a single class-scoped cluster,
saving ~20 cluster creation/teardown cycles.

Each test still creates and destroys its own Policy for isolation.
Standalone tests (checkpoint, loss_independent, grad_norm, etc.)
remain outside the class since they need custom cluster configs.

Estimated savings: ~5-10 minutes from avoided cluster overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…ests"

This reverts commit 1ffeb76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test 3a83519

kajalj22
kajalj22 previously approved these changes May 22, 2026
Comment thread tests/unit/test_recipes_and_test_suites.py
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test 0863f96

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test 766d6f3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

Comment thread tests/unit/models/policy/test_dtensor_worker.py
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/unit/experience/test_rollouts.py
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

kajalj22
kajalj22 previously approved these changes May 27, 2026
Comment thread .github/workflows/cicd-main.yml
Comment thread tests/unit/models/generation/test_vllm_generation.py
Comment thread tests/run_unit.sh
Comment thread tests/unit/models/generation/test_vllm_generation.py
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…ong/shard-tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
…ong/shard-tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814

Copy link
Copy Markdown
Contributor Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unit test shard selection

3 participants